Faster Set Intersection Algorithms for Text Searching ? University of Waterloo Technical Report CS - 2007 - 13

نویسندگان

  • Jérémy Barbay
  • Alejandro López-Ortiz
  • Tyler Lu
  • Alejandro Salinger
چکیده

The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data-set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Orthogonal Range Searching for Text Indexing

Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the suffix tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost impor...

متن کامل

Fast Inverted Indexes with On-Line Update

Charles L. A. Clarke Gordon V. Cormack Forbes J. Burkowski Dept. of Computer Science University of Waterloo, Waterloo, Canada, N2L 3G1 Technical Report CS-94-40 November 23, 1994 Abstract We describe data structures and an update strategy for the practical implementation of inverted indexes. The context of our discussion is the construction of a dedicated index engine for a distributed full-tex...

متن کامل

University of Waterloo Technical Report CS - 2008 - 14 Adaptive two - dimensional string matching for protein contact maps ?

Contact maps are two dimensional abstract representations of protein structures. One of the uses of contact maps is for the identification of patterns which correspond to some known configuration of protein secondary structures. In the past, searching for these patterns has generally used a näıve sliding window approach which is time consuming. We study several approaches that have been used fo...

متن کامل

Homothetic Polygons and Beyond: Intersection Graphs, Recognition, and Maximum Clique

We study the Clique problem in classes of intersection graphs of convex sets in the plane. The problem is known to be NP-complete in convex-set intersection graphs and straight-line-segment intersection graphs, but solvable in polynomial time in intersection graphs of homothetic triangles. We extend the latter result by showing that for every convex polygon P with sides parallel to k directions...

متن کامل

Faster Adaptive Set Intersections for Text Searching

The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and López-Ortiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corrob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007